Classification and Clustering

Annina Haverinen 19.11.2018

Loading the MASS “Boston” dataset already loaded in R. Link to dataset
This is a data set about Housing Values in Suburbs of Boston.

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. < em >J. Environ. Economics and Management < b >5, 81???102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) < em >Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley

Variables in the dataset:

  1. crim= per capita crime rate by town.
  2. zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus = proportion of non-retail business acres per town.
  4. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. nox = nitrogen oxides concentration (parts per 10 million).
  6. rm = average number of rooms per dwelling.
  7. age = proportion of owner-occupied units built prior to 1940.
  8. dis = weighted mean of distances to five Boston employment centres.
  9. rad = index of accessibility to radial highways.
  10. tax = full-value property-tax rate per $10,000.
  11. ptratio = pupil-teacher ratio by town.
  12. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  13. lstat = lower status of the population (percent).
  14. medv = median value of owner-occupied homes in $1000s.
library(MASS) 
data("Boston")
dim(Boston) #506 observation of 14 variables
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Names of the variables in the data
506 observation of 14 variables

gather(Boston) %>% 
  ggplot(aes(value)) + facet_wrap("key", scales= "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualisation of the variables. the variable rm is the only one looking normally distribiuted, the others seem rather skewed.

Correlation plots

library(tidyr)
library("ggplot2")
library("GGally")
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library("corrplot")
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)%>%round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
p1 <- ggcorr(cor_matrix,geom="circle")
# correlation plot matrix 
p1

?ggcorr
p2<- corrplot(cor_matrix, method = "circle",type="upper",cl.pos="b",tl.pos="d",tl.cex = 0.6)

p2
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Looking at the correlations plot we can see that positively correlated variables are; rad (index of accesibility to radial highways), lstat (lower status of the population, percent) tax (full-value property-tax rate per $10,000) to crim (our variable of interest is crimerate per capita by town). Negatively associated with crim are medv (median value of owner-occupied homes in $1000s) and dis (weighted mean of distances to five Boston employment centres).

Scaling the dataset

scaled_boston <- scale(Boston)
summary(scaled_boston)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
summary(scaled_boston$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
crim_vector <- quantile(scaled_boston$crim) #new crime variable
crim_vector
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <-cut(scaled_boston$crim, breaks = crim_vector, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
scaled_boston <- dplyr::select(scaled_boston, -crim)
scaled_boston <- data.frame(scaled_boston,crime)
n <- nrow(scaled_boston)
n
## [1] 506
train <- sample(n, size = n * 0.8)
train_set <- scaled_boston[train,] # training set with 80% of obs.
dim(train_set)
## [1] 404  14
test <- scaled_boston [-train,] # test set with 20% of obs.
dim(test)
## [1] 102  14
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Linear discriminant analysis

  • classification method to find the linare combination of the variables that separates the target variable classes.
  • in this exercis the target variable is crime-rate
lda1<-lda(crime~., data=train_set)
lda1
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2425743 0.2549505 0.2524752 
## 
## Group means:
##                  zn      indus         chas        nox         rm
## low       0.9642146 -0.9121498 -0.116404306 -0.8879627  0.4433907
## med_low  -0.1499109 -0.2752347 -0.071456607 -0.5576972 -0.1614688
## med_high -0.3690158  0.1267050  0.224586496  0.3787396  0.1191614
## high     -0.4872402  1.0171096 -0.002135914  1.1104597 -0.4659708
##                 age        dis        rad        tax     ptratio
## low      -0.8545322  0.9268423 -0.6964601 -0.7358577 -0.44959809
## med_low  -0.2736328  0.2806592 -0.5447506 -0.4809842 -0.06901381
## med_high  0.4066057 -0.3633938 -0.4745387 -0.3809150 -0.30234605
## high      0.8071096 -0.8593042  1.6382099  1.5141140  0.78087177
##               black       lstat        medv
## low       0.3789899 -0.76574214  0.51263076
## med_low   0.3164544 -0.12451730 -0.01743563
## med_high  0.1320563 -0.01255628  0.21547617
## high     -0.7429107  0.91525025 -0.68483433
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## zn       0.08840379  0.70557144 -0.917616680
## indus    0.08612616 -0.12471345  0.320382732
## chas    -0.02779654 -0.04817284  0.005687493
## nox      0.40469722 -0.70717805 -1.573586440
## rm      -0.04662213 -0.05333190 -0.238574268
## age      0.12086530 -0.26798321 -0.053459745
## dis     -0.06726274 -0.12188315 -0.080340692
## rad      4.01529318  0.96815220  0.007205622
## tax      0.06537236 -0.06683135  0.560167160
## ptratio  0.13465626  0.03802825 -0.453311859
## black   -0.07825554  0.03975242  0.110194110
## lstat    0.20234831 -0.29786104  0.239076638
## medv     0.09844462 -0.39112380 -0.240746769
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9644 0.0262 0.0094
classes <- as.numeric(train_set$crime) # 1,2,3,4 in stead of low,med_low,med_high, high

LDA biplot

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda1, myscale = 1)

##Predict with LDA

# predict classes with test data
lda.pred <- predict(lda1, newdata = test)
# cross tabulate the results
table(correct = lda.pred$class, prediction = correct_classes)%>%addmargins()
##           prediction
## correct    low med_low med_high high Sum
##   low       19       9        0    0  28
##   med_low    7      16        4    0  27
##   med_high   0       3       15    0  18
##   high       0       0        4   25  29
##   Sum       26      28       23   25 102

Prediction was right in 56% of test set
high crime was prdicted in 16/18 cases
med_high crime was predicted in 18/24
high crime was predicted in 26/27

Clustering

library(MASS)
data("Boston")
scaled_boston <- scale(Boston)

K-means clustering

#determining the number of clusters
wss <- (nrow(scaled_boston)-1)*sum(apply(scaled_boston,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_boston, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

km<- kmeans(scaled_boston,3)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       42    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
km1 <- kmeans(scaled_boston, 5)
summary(km1)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       70    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

3 clusters would be appropriate.

pairs(scaled_boston, col=km$cluster)

BONUS

data("Boston")
scaled_boston <- scale(Boston)
dim(scaled_boston)
## [1] 506  14
km_bonus <- kmeans(scaled_boston, 4)
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
lda2 <- lda(km_bonus$cluster~., data = scaled_boston)
lda2
## Call:
## lda(km_bonus$cluster ~ ., data = scaled_boston)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.46047431 0.25296443 0.20355731 0.08300395 
## 
## Group means:
##         crim         zn     indus        chas        nox          rm
## 1 -0.3770786 -0.3361581 -0.357024  0.01492717 -0.3627826 -0.05942108
## 2  0.3860352 -0.4872402  1.177101  0.09677408  1.1367747 -0.46615004
## 3 -0.4083200  1.5646181 -1.071144 -0.04298342 -1.0195629  0.88181624
## 4  1.9167567 -0.4872402  1.020131 -0.27232907  1.0484799 -0.41225609
##          age         dis        rad        tax    ptratio      black
## 1 -0.0510395  0.07344745 -0.5767039 -0.6112497 -0.1132745  0.2885204
## 2  0.8113429 -0.84147223  1.0055151  1.1534294  0.5524522  0.1238290
## 3 -1.2147329  1.24283836 -0.6005355 -0.6591517 -0.7342052  0.3535227
## 4  0.7894713 -0.89088476  1.6076484  1.4922585  0.7452908 -2.8449572
##        lstat        medv
## 1 -0.2079635  0.07175058
## 2  0.7316707 -0.52309197
## 3 -0.9453267  0.94512754
## 4  1.2421501 -1.12167260
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## crim     0.279080804 -0.79877867 -0.08683356
## zn       0.030140565 -0.33285057  1.26937453
## indus    0.824042075  0.43909568  0.42582133
## chas    -0.052691582  0.02323526 -0.02531932
## nox      0.880477101  0.52650678  0.88365117
## rm      -0.040434271 -0.22406650  0.36800546
## age     -0.045316679  0.24162605 -0.57110107
## dis     -0.047493769 -0.02490285  0.49487630
## rad      0.634634714  0.16879880  0.07074025
## tax      0.617515825  0.32440160  0.68401593
## ptratio  0.311095137  0.03084074  0.20476753
## black   -0.727324754  1.84389704  0.52511602
## lstat    0.220293087 -0.14348056  0.44187829
## medv     0.002872153 -0.12636659  0.57372717
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.6912 0.2028 0.1060
classes1 <- as.numeric(km_bonus$cluster)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda2, dimen = 2, col = classes1, pch = classes1)
lda.arrows(lda2, myscale = 1)

##superbonus

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda1$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda1$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)